Existence and Finiteness Conditions for Risk-Sensitive Planning: First Results
نویسندگان
چکیده
Decision-theoretic planning with risk-sensitive planning objectives is important for building autonomous agents or decisionsupport agents for real-world applications. However, this line of research has been largely ignored in the artificial intelligence and operations research communities since planning with risksensitive planning objectives is much more complex than planning with risk-neutral planning objectives. To remedy this situation, we develop conditions that guarantee the existence and finiteness of the expected utilities of the total plan-execution reward for risk-sensitive planning with totally observable Markov decision process models. In case of Markov decision process models with both positive and negative rewards our results hold for stationary policies only, but we conjecture that they can be generalized to hold for all policies. Introduction Decision-theoretic planning is important since real-world applications need to cope with uncertainty. Many decisiontheoretic planners use totally observable Markov decision process (MDP) models from operations research (Puterman 1994) to represent planning problems under uncertainty. However, most of them minimize the expected total plan-execution cost or, synonymously, maximize the expected total reward (MER). This planning objective and similar simplistic planning objectives often do not take the preferences of human decision makers sufficiently into account, for example, their risk attitudes in planning domains with huge wins or losses of money, equipment or human life. This means that they are not well suited for real-world planning domains such as space applications (Zilberstein et al. 2002), environmental applications (Blythe 1997), and business applications (Goodwin, Akkiraju, & Wu 2002). In this paper, we provide a first step towards a comprehensive foundation of risk-sensitive planning. In particular, we develop sets of conditions that guarantee the existence and finiteness of the expected utilities when maximizing the expected utility (MEU) of the total reward for risk-sensitive planning with totally observable Markov decision process models and non-linear utility functions. Risk Attitudes and Utility Theory Human decision makers are typically risk-sensitive and thus do not maximize the expected total reward in planning domains with huge wins or losses. Table 1 shows an example Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Table 1: An Example of Risk Sensitivity Probability Reward Expected Reward Utility Expected Utility Choice 1 50% $10,000,000 $5,000,000 −0.050 −0.525 50% $ 0 −1.000 Choice 2 100% $ 4,500,000 $4,500,000 −0.260 −0.260 for which many human decision makers prefer Choice 2 over Choice 1 even though its expected total reward is lower. They are risk-averse and thus accept a reduction in expected total reward for a reduction in variance. Utility theory (von Neumann & Morgenstern 1944) suggests that this behavior is rational because human decision makers maximize the expected utility of the total reward. Utility functions map total rewards to the corresponding finite utilities and capture the risk attitudes of human decision makers (Pratt 1964). They are strictly monotonically increasing in the total reward. Linear utility functions characterize risk-neutral human decision makers, while non-linear utility functions characterize risk-sensitive human decision makers. In particular, concave utility functions characterize risk-averse human decision makers (“insurance holders”), and convex utility functions characterize risk-seeking human decision makers (“lottery players”). For example, if a risk-averse human decision maker has the concave exponential utility function U(w) = −0.9999997 and thus associates the utilities shown in Table 1 with the total rewards of the two choices, then Choice 2 maximizes their expected utility and should thus be chosen by them. On the other hand, MER planners choose Choice 1, and the human decision maker would thus be extremely unhappy with them with 50 percent probability. Markov Decision Process Models We study decision-theoretic planners that use MDPs to represent probabilistic planning problems. Formally, an MDP is a 4-tuple (S,A,P, r) of a state space S, an action space A, a set of transition probabilities P , and a set of finite (immediate) rewards r. If an agent executes action a ∈ A in state s ∈ S, then it incurs reward r(s, a, s) and transitions to state s ∈ S with probability P (s|s, a). An MDP is called finite if its state space and action space are both finite. We assume throughout this paper that the MDPs are finite since decision-theoretic planners typically use finite MDPs. The number of time steps that a decision-theoretic planner plans for is called its (planning) horizon. A history at time step t is the sequence ht = (s0, a0, · · · , st−1, at−1, st) of states and actions from the initial state to the current state. The set of all histories at time step t is Ht = (S ×A) × S. A trajectory is an element of H∞ for infinite horizons and HT for finite horizons, where T ≥ 1 denotes the last time step of the finite horizon. Decision-theoretic planners determine a decision rule for every time step within the horizon. A decision rule determines which action the agent should execute in its current state. A deterministic history-dependent (HD) decision rule at time step t is a mapping dt : Ht → A. A randomized history-dependent (HR) decision rule at time step t is a mapping dt : Ht → P(A), where P(A) is the set of probability distributions over A. Markovian decision rules are historydependent decision rules whose actions depend only on the current state rather than the complete history at the current time step. A deterministic Markovian (MD) decision rule at time step t is a mapping dt : S → A. A randomized Markovian (MR) decision rule at time step t is a mapping dt : S → P(A). A policy π is a sequence of decision rules dt, one for every time step t within the horizon. We use Π to denote the set of all policies whose decision rules all belong to the same class, where K ∈ {HR,HD,MR,MD}. The set of all possible policies Π is the same as ΠHR. Decisiontheoretic planners typically determine stationary policies. A Markovian policy π ∈ Π is stationary if dt = d for all time steps t, and we write π(s) = d(s). We use ΠSD to denote the set of all deterministic stationary (SD) policies and ΠSR to denote the set of all randomized stationary (SR) policies. The state transitions resulting from stationary policies are determined by Markov chains. A state of a Markov chain and thus also a state of an MDP under a stationary policy is called recurrent iff the expected number of time steps between visiting the state is finite. A recurrent class is a maximal set of states that are recurrent and reachable from each other. These concepts play an important role in the proofs of our results.
منابع مشابه
Existence and Finiteness Conditions for Risk-Sensitive Planning: Results and Conjectures
Decision-theoretic planning with risk-sensitive planning objectives is important for building autonomous agents or decision-support systems for real-world applications. However, this line of research has been largely ignored in the artificial intelligence and operations research communities since planning with risk-sensitive planning objectives is more complicated than planning with risk-neutra...
متن کاملSensitivity analysis of waters at higher risk subjected to soil contaminations
Existence of heavy minerals in waters in some regions causes serious environmental pollution and affects the soil and water resources quality. Soil and waters sensitive to pollutants were investigated in this research, based on the degree of sensitivity of their parameters. The samples obtained from industrial and agricultural areas for identifying water tables at higher risk and display region...
متن کاملThe role of optimal site selection of fire stations in urban safety
Constructing new urban facilities requires precise study of the proper sitting in different parts of a city. The first point for the accurate allocation of urban facilities, is to select the optimal site regarding different conditions and it matters when it comes to important factors such as saving lives and safety of the people. Therefore, the optimal site selection of fire stations is essenti...
متن کاملThe Prevalence of Urban Areas Vulnerability to Seismic Risk (A Case Study of Region One, Tehran)
Urban planning rules and considering land use regarding faults can change the consequences of natural hazard such as earthquake. Vulnerability risk is increasing in Region 1 because of existence of the north fault, steep slopes and continuous construction of high-rise buildings. It is clear that Region 1’s Master Plan shouldn’t be prepared without considering natural hazard such as ...
متن کاملTREE AUTOMATA BASED ON COMPLETE RESIDUATED LATTICE-VALUED LOGIC: REDUCTION ALGORITHM AND DECISION PROBLEMS
In this paper, at first we define the concepts of response function and accessible states of a complete residuated lattice-valued (for simplicity we write $mathcal{L}$-valued) tree automaton with a threshold $c.$ Then, related to these concepts, we prove some lemmas and theorems that are applied in considering some decision problems such as finiteness-value and emptiness-value of recognizable t...
متن کامل